AITopics | new document

Collaborating Authors

new document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Recursive Abstractive Processing for Retrieval in Dynamic Datasets

Chucri, Charbel, Azouz, Rami, Ott, Joachim

arXiv.org Artificial IntelligenceOct-2-2024

Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.

algorithm, dataset, postqfrap, (16 more...)

arXiv.org Artificial Intelligence

2410.01736

Country:

North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (0.93)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Data Science > Data Mining (0.93)
(3 more...)

Add feedback

Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization

Chen, Xiuying, Gao, Shen, Li, Mingzhe, Zhu, Qingqing, Gao, Xin, Zhang, Xiangliang

arXiv.org Artificial IntelligenceJun-8-2024

Nowadays, neural text generation has made tremendous progress in abstractive summarization tasks. However, most of the existing summarization models take in the whole document all at once, which sometimes cannot meet the needs in practice. Practically, social text streams such as news events and tweets keep growing from time to time, and can only be fed to the summarization system step by step. Hence, in this paper, we propose the task of Stepwise Summarization, which aims to generate a new appended summary each time a new document is proposed. The appended summary should not only summarize the newly added content but also be coherent with the previous summary, to form an up-to-date complete summary. To tackle this challenge, we design an adversarial learning model, named Stepwise Summary Generator (SSG). First, SSG selectively processes the new document under the guidance of the previous summary, obtaining polished document representation. Next, SSG generates the summary considering both the previous summary and the document. Finally, a convolutional-based discriminator is employed to determine whether the newly generated summary is coherent with the previous summary. For the experiment, we extend the traditional two-step update summarization setting to a multi-step stepwise setting, and re-propose a large-scale stepwise summarization dataset based on a public story generation dataset. Extensive experiments on this dataset show that SSG achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Ablation studies demonstrate the effectiveness of each module in our framework. We also discuss the benefits and limitations of recent large language models on this task.

dataset, information, summarization, (14 more...)

arXiv.org Artificial Intelligence

2406.05361

Country:

Asia > China (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre: Research Report (1.00)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

A Survey of Generative Information Retrieval

Kuo, Tzu-Lin, Chiu, Tzu-Wei, Lin, Tzung-Sheng, Wu, Sheng-Yang, Huang, Chao-Wei, Chen, Yun-Nung

arXiv.org Artificial IntelligenceJun-4-2024

Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to directly map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking. This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges. We discuss various document identifier strategies, including numerical and string-based identifiers, and explore different document representation methods. Our primary contribution lies in outlining future research directions that could profoundly impact the field: improving the quality of query generation, exploring learnable document identifiers, enhancing scalability, and integrating GR with multi-task learning frameworks. By examining state-of-the-art GR techniques and their applications, this survey aims to provide a foundational understanding of GR and inspire further innovations in this transformative approach to information retrieval. We also make the complementary materials such as paper collection publicly available at https://github.com/MiuLab/GenIR-Survey/

document identifier, identifier, retrieval, (12 more...)

arXiv.org Artificial Intelligence

2406.01197

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Instruction-tuned Language Models are Better Knowledge Learners

Jiang, Zhengbao, Sun, Zhiqing, Shi, Weijia, Rodriguez, Pedro, Zhou, Chunting, Neubig, Graham, Lin, Xi Victoria, Yih, Wen-tau, Iyer, Srinivasan

arXiv.org Artificial IntelligenceMay-25-2024

In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8%.

knowledge, language model, qa pair, (15 more...)

arXiv.org Artificial Intelligence

2402.12847

Country:

Asia > Singapore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(8 more...)

Genre:

Research Report (0.50)
Instructional Material (0.34)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Guo, Jiafeng, Zhou, Changjiang, Zhang, Ruqing, Chen, Jiangui, de Rijke, Maarten, Fan, Yixing, Cheng, Xueqi

arXiv.org Artificial IntelligenceFeb-26-2024

Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers. Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance. However, most existing research on KILTs, including CorpusBrain, has predominantly focused on a static document collection, overlooking the dynamic nature of real-world scenarios, where new documents are continuously being incorporated into the source corpus. To address this gap, it is crucial to explore the capability of retrieval models to effectively handle the dynamic retrieval scenario inherent in KILTs. In this work, we first introduce the continual document learning (CDL) task for KILTs and build a novel benchmark dataset named KILT++ based on the original KILT dataset for evaluation. Then, we conduct a comprehensive study over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in the dynamic scenario, hence hampering the retrieval performance. To alleviate this issue, we propose CorpusBrain++, a continual generative pre-training framework. Empirical results demonstrate the significant effectiveness and remarkable efficiency of CorpusBrain++ in comparison to both traditional and generative IR methods.

dataset, docid, retrieval performance, (15 more...)

arXiv.org Artificial Intelligence

2402.16767

Country:

Africa > South Africa (0.69)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Beijing > Beijing (0.04)
(4 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology (0.93)
Government > Regional Government > Africa Government > South Africa Government (0.47)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

DSI++: Updating Transformer Memory with New Documents

Mehta, Sanket Vaibhav, Gupta, Jai, Tay, Yi, Dehghani, Mostafa, Tran, Vinh Q., Rao, Jinfeng, Najork, Marc, Strubell, Emma, Metzler, Donald

arXiv.org Artificial IntelligenceDec-8-2023

Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents ($+12\%$). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

corpus, indexing, retrieval task, (17 more...)

arXiv.org Artificial Intelligence

2212.09744

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

IncDSI: Incrementally Updatable Document Retrieval

Kishore, Varsha, Wan, Chao, Lovelace, Justin, Artzi, Yoav, Weinberger, Kilian Q.

arXiv.org Artificial IntelligenceJul-19-2023

Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.10323

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Add feedback

Understanding What is Behind Sentiment Analysis – Part 2

@machinelearnbotApr-20-2018, 22:35:24 GMT

Hint! Check Part I first, where we introduced a simple algorithm to analyze the sentiment of a given document. In this article we will talk about different modifications that might help us improve the performance of our classifier. To create a good classifier with the model described in Part I, we need a big and properly labelled corpus in order to compute a comprehensive word-sentiment occurrence table. In the training corpus, there should be statistically enough examples of each word in different contexts so the occurrences computed in the table can leverage a good approximation of their real probabilities (frequencies). There are several techniques aimed to reduce the dimensionality of the problem to make it more manageable.

artificial intelligence, natural language, negation, (19 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.52)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.52)

Add feedback

Comparison of machine learning methods in email spam detection

#artificialintelligenceFeb-12-2018, 03:26:03 GMT

Unsolicited bulk emails, also known as Spam, make up for approximately 60% of the global email traffic. Despite the fact that technology has advanced in the field of Spam detection since the first unsolicited bulk email was sent in 1978 spamming remains a time consuming and expensive problem. This report compares the performance of three machine learning techniques for spam detection including Random Forest (RF), k-Nearest Neighbours (kNN) and Support Vector Machines (SVM). Despite the rising popularity of instant messaging technologies in recent years, email continues to be the dominant medium for digital communications for both consumer and business use. Following industry estimations (Symantec Corporation, 2016, pp 31 1), approximately 200 billion emails were sent each day in 2015.

artificial intelligence, dataset, machine learning, (15 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.59)

Add feedback

New NHTSA Robocar regulations are a major, but positive, reversal

RobohubSep-15-2017, 17:30:17 GMT

NHTSA released their latest draft robocar regulations just a week after the U.S. House passed a new regulatory regime and the senate started working on its own. The proposed regulations preempt state regulation of vehicle design, and allow companies to apply for high volume exemptions from the standards that exist for human-driven cars. It's clear that the new approach will be quite different from the Obama-era one, much more hands-off. There are not a lot of things to like about the Trump administration but this could be one of them. The prior regulations reached 116 pages with much detail, though they were mostly listed as "voluntary."

government, regulation, vehicle, (15 more...)

Robohub

Country: North America > United States (0.55)

Industry:

Transportation > Ground > Road (1.00)
Law > Statutes (1.00)
Government > Regional Government > North America Government > United States Government (0.55)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback